Multi-Armed Bandits with Discount Factor Near One: The Bernoulli Case
نویسندگان
چکیده
منابع مشابه
Multi-Armed Bandits with Betting
In this paper we consider an extension where the gambler has, at each round, K coins available for play, and the slot machines accept bets. If the player bets m coins on a machine, then the machine will return m times the payoff of that round. It is important to note that betting m coins on a machine results in obtaining a single sample from the rewards distribution of that machine (multiplied ...
متن کاملContextual Multi-Armed Bandits
We study contextual multi-armed bandit problems where the context comes from a metric space and the payoff satisfies a Lipschitz condition with respect to the metric. Abstractly, a contextual multi-armed bandit problem models a situation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions ...
متن کاملTwo-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first ...
متن کاملStaged Multi-armed Bandits
In conventional multi-armed bandits (MAB) and other reinforcement learning methods, the learner sequentially chooses actions and obtains a reward (which can be possibly missing, delayed or erroneous) after each taken action. This reward is then used by the learner to improve its future decisions. However, in numerous applications, ranging from personalized patient treatment to personalized web-...
متن کاملMortal Multi-Armed Bandits
We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: The Annals of Statistics
سال: 1981
ISSN: 0090-5364
DOI: 10.1214/aos/1176345578